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^ ■ Abstract 
< 

OO I We consider principal component analysis (PCA) in decomposable Gaussian graphical models. We 

exploit the prior information in these models in order to distribute its computation. For this purpose, 
^ ^ i we reformulate the problem in the sparse inverse covariance (concentration) domain and solve the 

global eigenvalue problem using a sequence of local eigenvalue problems in each of the cliques of the 
decomposable graph. We demonstrate the application of our methodology in the context of decentralized 
^ ', anomaly detection in the Abilene backbone network. Based on the topology of the network, we propose 

an approximate statistical graphical model and distribute the computation of PCA. 

> 

. I. Introduction 

cn 

^ . We consider principal component analysis (PCA) in Gaussian graphical models. PCA is a classical 

OO ■ 

O ' dimensionality reduction method which is frequently used in statistics and machine learning [1 1], [1]. The 

OO : 

O ■ first principal components of a multivariate are its orthogonal linear combinations which preserve most of 



the variance. In the Gaussian case, PCA has special properties which make it especially favorable: it is the 
best linear approximation of the data and it provides independent components. On the other hand, Gaussian 
graphical models, also known as covariance selection models, provide a graphical representation of the 
conditional independence structure within the Gaussian distribution [16], [7]. Exploiting the extensive 
knowledge and literature on graph theory, graphical models allow for efficient distributed implementation 
of statistical inference algorithms, e.g., the well known belief propagation method and the junction tree 
algorithm [20], [13]. In particular, decomposable graphs, also known as chordal or triangulated graphs, 
provide simple and intuitive inference methods due to their appealing structure. Our main contribution is 
the application of decomposable graphical models to PCA which we nickname DPCA, where D denotes 
both Decomposable and Distributed. 
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The main motivation for distributed PCA is decentralized dimensionality reduction. It plays a leading 
role in distributed estimation and compression theory in wireless sensor networks [23], [19], [21], [9], 
[18], and decentralized data mining techniques [14], [2], [17]. It is also used in anomaly detection in 
computer networks [15], [6], [12]. In particular, [9], [18] proposed to approximate the global PCA using 
a sequence of conditional local PCA solutions. Alternatively, an approximate solution which allows a 
tradeoff between performance and communication requirements was proposed in [12] using eigenvalue 
perturbation theory. 

DPCA is an efficient implementation of distributed PCA based on a prior graphical model. Unlike the 
above references it does not try to approximate PCA, but yields an exact solution up to on any given 
tolerance. On the other hand, it assumes additional prior knowledge in the form of a graphical model 
which previous works did not take into account. Although, it is interesting to note that the Gauss Markov 
source example in [9], [18] is probably the most celebrated decomposable graphical model. Therefore, we 
now address the availability of such prior information. In general, practical applications do not necessarily 
satisfy any obvious conditional independence structure. In such scenarios, DPCA can be interpreted as 
an approximate PCA method that allows a tradeoff between accuracy and decentrahzation by introducing 
sparsity. In other problems it is reasonable to assume that an unknown structure exists and can be learned 
from the observed data using existing methods such as [3], [8], [22]. Alternatively, a graphical model 
can be derived from non-statistical prior knowledge on the specific application. An intuitive example is 
distributed networks in which the topology of the network suggests a statistical graph as exploited in [5]. 
Finally, we emphasize that even if a prior graphical model is available, it does not necessarily satisfy a 
decomposable form. In this case, a decomposable approximation can be obtained using classical graph 
theory algorithms [13]. 

PCA can be interpreted as maximum likelihood (ML) estimation of the covariance using the available 
data followed by its eigenvalue decomposition. When a prior graphical model is available, PCA can still 
be easily obtained by adjusting the ML estimation phase to incorporate the prior conditional independence 
structure using existing methods [16], [7], and then computing the eigenvalue decomposition (EVD). The 
drawback to this approach is that it does not exploit the structure of the graph in the EVD phase. This 
disadvantage is the primary motivation to DPCA which is specifically designed to utihze the structure 
of Gaussian graphical models. Decomposable covariance selection models result in sparse concentration 
(inverse covariance) matrices which can be estimated in a decentralized manner. Therefore, we propose 
to reformulate DPCA in the concentration domain and solve the global EVD using a sequence of local 
EVD problems in each of the cliques of the decomposable graph with a small amount of message passing. 
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This allows for distributed implementation according to the topology of the graph and reduces the need 
to collect all the observed data in a centraUzed processing unit. When the algorithm terminates, each 
clique obtains its own local version of the principal components. 

To illustrate DPCA we apply it to distributed anomaly detection in computer networks [15], [12]. In 
this context, DPCA learns a low dimensional model of the normal traffic behavior and allows for simple 
outher detection. This application is natural since the network's topology provides a physical basis for 
constructing an approximate a graphical model. For example, consider two nodes which are geographically 
distant and linked only through a long path of nodes. It is reasonable to believe that these two sensors are 
independent conditioned on the path, but a theoretical justification of this assertion is difficult and depends 
on the specific problem formulation. We examine the vahdity of this claim in the context of anomaly 
detection in the Abilene network using a real-world dataset. We propose an approximate decomposition 
of the Abilene network, enable the use of DPCA and obtain a fully distributed anomaly detection method. 

The outline of the paper is as follows. Decomposable graphs are easy to explain using a special graph 
of two cliques which is their main building block. Therefore, we begin in section 11 by introducing the 
problem formulation and solution to DPCA in this simple case. The generalization to decomposable 
graphs is presented in section m which consists of their technical definitions followed by a recursive 
apphcation of the two cliques solution. We demonstrate the use of DPCA using two numerical examples. 
First, in Section IV we simulate our proposed algorithm in a synthetic tracking scenario. Second, in 
Section V we illustrate its apphcation to anomaly detection using a real-world dataset from the Abilene 
backbone network. Finally, in Section VI we provide concluding remarks and address future work. 

The following notation is used. Boldface upper case letters denote matrices, boldface lower case letters 
denote column vectors, and standard lower case letters denote scalars. The superscripts (•)^ and (•)^^ 
denote the transpose and matrix inverse, respectively. The cardinality of a set a is denoted by \a\. The 
matrix I denotes the identity, eig^jin (X) is the minimum eigenvalue of square symmetric matrix X, 
UnuU (X) is a null vector of X, eig^ax (^) is the maximum eigenvalue of X, and X >- means that X 
is positive definite. Finally, we use indices in the subscript [x]^ or [X]^ ^ to denote sub- vectors or sub- 
matrices, respectively, and [X]^ . denotes the sub-matrix formed by the a'th rows in X. Where possible, 
we omit the brackets and use or X^ instead. 

II. Two CLIQUE DPCA 

In this section, we introduce DPCA for a simple case which will be the building block for the general 
algorithm. 
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Fig. 1. Graphical model with two cUques modeling a 3 node network in which a and b are conditionally independent given c. 



A. Problem Formulation 

T 



Let X 



T T T 
Xc H 



be a length p, zero mean Gaussian random vector in which [x]^ and [x]j, are 
independent conditionally on [x]^ where a, c and 6 are disjoint subsets of indices. For later use, we 
use graph terminology and define two cliques of indices Ci = {a, c} and C2 = {c, b} coupled through 
the separator S = {c} (see Fig. 1). We assume that the covariance matrix of x is unknown, but the 
conditional independence structure (defined through index sets Ci and C2) is known. 

The input to DPCA is a set of n independent and identically distributed realizations of x denoted by Xj 
for i = 1, • ■ ■ , n. More specifically, this input is distributed in the sense that the first cUque has access to 
[xj]^^ for z = 1, ■ ■ ■ , n, whereas the second cUque has access only to [xj](^^ for i = 1, • • • , n. Using this 
data and minimal message passing between the two cliques, DPCA searches for the linear combination 
X = u^x having maximal variance. When the algorithm terminates, each of the cUques obtains its own 
local version of u, i.e., the sub-vectors [u]^^ and [uj^^^. 

The following subsections present the proposed solution to DPCA. It involves two main stages: 
covariance estimation and principal components computation. 



B. Solution: covariance matrix estimation 

First, the covariance matrix of x is estimated using the maximum UkeUhood (ML) technique. Due to 
the known conditional independence structure, the ML estimate has a simple closed form solution which 
can be computed in a distributed manner (more details about this procedure can be found in [16]). Each 
clique and the separator computes their own local sample covariance matrices 

s^^'^^ = ^ENcW^^ (1) 

1=1 

1 " 

SCC. ^ -Y.[^^]cA^^fc. (2) 
1=1 
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(3) 



where the tilde and the superscripts are used to emphasize that these are local estimates. Similarly, the 
local concentration matrices, also known as precision matrices, are defined as 

^ (s^^'^^)~' (4) 
-^C2,c^ = (s'^^''^^y^ (5) 

(6) 

where it is assumed that the matrices are non-singular (otherwise, the ML estimate does not exist). Next, 
the global ML concentration matrix K is obtained by requiring 



K 



a.b 



K 



b,a 







(7) 



due to the conditional independence of Xa and x;, given Xc. The global solution is 



K 













+ 








KC2,C2 








K^"^ 




(8) 



It is easy to see that the sub-matrices associated with the cliques are perturbations of to their local 
versions: 





K 



\K] 



Ci,Ci 



K 



Ci,C: 



K 



C2 ,C2 



+ 



Mb 



Ma 







(9) 



(10) 



and require only message passing via M;, and M^. 



Mb = 


J^C2,C2 






(11) 






s,s 




Ma = 




s,s 




(12) 







The dimension of these messages is equal to IS*! which is presumably small. Thus, the global ML 
concentration matrix can be easily found in a distributed manner. 

The global covariance estimate is simply defined as the inverse of its concentration S = K~^. It is 
consistent with the local estimates of its sub-matrices: 



(13) 
(14) 
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but there is no special intuition regarding its [S]^ and [S]^ ^ sub-blocks. 

C. Solution: first principal eigenvalue 
Given the global ML covariance estimate S, the PCA objective function is estimated as 

u^Su, (15) 

which is maximized subject to a norm constraint to yield 

{maxu u^Su 
(16) 
s.t. u^u = 1. 

This optimization gives both the maximal eigenvalue of S and the its eigenvector u. 

The drawback to the above solution is that the EVD computation requires centrahzed processing and 
does not exploit the structure of K. Each chque needs to send its local covariance to a central processing 
unit which constructs S and computes its maximal eigenvalue and eigenvector. We will now provide 
an alternative distributed DPCA algorithm in which each clique uses only local information along with 
minimal message passing in order to calculate its local version of eigma^ (S) and u. 

Our first observation is that DPCA can be equivalently solved in the concentration domain instead of 
the covariance domain. Indeed, it is well known that 

when the inverse K = exists. The corresponding eigenvectors are also identical. The advantage of 
working with K instead of S is that we can directly exploit K's sparsity as expressed in (7). 

Before continuing it is important to address the question of singularity of S. One may claim that 
working in the concentration domain is problematic since S may be singular. This is indeed true but 
is not a critical disadvantage since graphical models allow for well conditioned estimates under small 
sample sizes. For example, classical ML exists only if n>p, whereas the ML described above requires 
the less stringent condition n > max{|Ci|, IC2I} [16]. In fact, the ML covariance is defined as the inverse 
of its concentration, and thus the issue of singularity is an intrinsic problem of ML estimation rather than 
the DPCA solution. 

We now return to the problem of finding 

A = eig^i„(K) (18) 
in a distributed manner. We begin by expressing A as a trivial hne-search problem: 

A = sup t s.t. t < eiguiin (K) (19) 
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and note that the objective is Unear and the constraint set is convex. It can be solved using any standard 
line-search algorithm, e.g. bisection. At first, this representation seems useless as we still need to evaluate 
eigjjjjji (K) which was our original goal. However, the following proposition shows that checking the 
feasibility of a given t can be done in a distributed manner. 

Proposition 1: Let K be a symmetric matrix with = K^^ = 0. Then, the constraint 



t < eig„ 
























is equivalent to the following pair of constraints 














t < eigmin ^Kci,Ci - 






M{t) 


J 



with the message matrix defined as 

M(t)=Ke,6(Kb,6-tI)-'Kb,e. 

Proof: The proof is obtained by rewriting (20) as a linear matrix inequality 


















-il :^ 







K6,6 





and decoupling this inequality using the following lemma: 

Lemma 1 (Schur's Lemma [4, Appendix A5. 5]): Let X be a symmetric matrix partitioned as 

A B 
C 

Then, X :^ if and only if A :^ and C - B^A-^B >- 0. 
Applying Schur's Lemma to (24) with A = Kc^ and rearranging yields 



X 



tl -< Kb,b 





M{t) 

Finally, (21) and (22) are obtained by rewriting (26) and (27) as eigenvalue inequalities, respectively. 



(20) 



(21) 
(22) 



(23) 



(24) 



(25) 



(26) 
(27) 
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Proposition 1 provides an intuitive distiibuted solution to (19). For any given t we can check the 
feasibihty by solving local eigenvalue problems and message passing via M {t) whose dimension is 
equal to the cardinaUty of the separator. The optimal global eigenvalue is then defined as the maximal 

globally feasible t . 

We note that the solution in Proposition 1 is asymmetric with respect to the cliques. The global 
constraint is replaced by two local constraints regarding clique Ci = {a, c} and the remainder {b}. 
Alternatively, we can exchange the order and partition the indices into {a} and C2 = {c,b}. This 
asymmetry will become important in the next section when we extend the results to general decomposable 
graphs. 

D. Solution: first principal eigenvector 

After we obtain the minimal eigenvalue A, we can easily recover its corresponding eigenvector u. For 
this purpose, we define Q = K — AI and obtain u = Unuu (Q). The matrix Q follows the same block 
sparse structure as K, and the linear set of equations Qu = can be solved in a distributed manner. 
There are two possible solutions. Usually, Q^^ is non-singular in which case the solution is 



u 









\ 





















M 


/ 



c ' 



where the message M is defined as 
Otherwise, if Q^^b is singular then the solution is simply 

= Unull {^b,b) 



(28) 
(29) 

(30) 

(31) 
(32) 



This singular case is highly unhkely as the probabihty of (31) in continuous models is zero. However, it 
should be checked for completeness. 



E. Solution: higher order components 

In practice, dimensionality reduction involves the projection of the data into the subspace of a few 
of the first principal components. We now show that the algorithm in II-C can be extended to provide 
higher order components. 
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The j'th principal component is defined as the hnear transformation which is orthogonal to the 
preceding components and preserves maximal variance. Similarly to the first component it is given 
by Xj = ujx where Uj is the j'th principal eigenvector of S. In the concentration domain, Uj is the 

eigenvector associated with Xj, the j'th smallest eigenvalue of K. 

In order to distribute the computation of Xj, we adjust (19) using the following lemma: 

Lemma 2: Let K be a symmetric matrix with eigenvalues Ai <,•••,< Ap and eigenvectors ui, ■ ■ ■ , Up. 

Then, 

Xj = sup t s.t. t < eig^in K + ^ ViUiuJ . (33) 
{vi}izl,t \ i=i J 

The optimal Vi are any values which satisfy Vi > Xj — Xi for i = 1, ■ ■ ■ , j — 1. 

Proof: The proof is based on the recursive variational characterization of of the fth smallest 
eigenvalue^: 

minu U"^Ku 

s.t. u^u = 1 (34) 
u^Ui = 0, i = l,--- ,j -1 
where are the preceding eigenvectors, and the optimal solution Uj = u is the eigenvector associated 
with Xj. A dual representation can be obtained using Lagrange duahty. We rewrite the orthogonahty 
restrictions as quadratic constraints u-^Ujuf u = and eliminate them using Lagrange multipliers: 



A,- > max mmt + u^ 



K - tl + ^ ViUiu] 



u (35) 



i=l 

where the inequaUty is due to the weak duality [4]. The inner minimization is unbounded unless 

i-i 

K - il + ^ ViUiuf y 0. (36) 

i=l 

Therefore, 

/ i-i \ 

Xj > max_ s.t. t < eigmin ^ + ^ ^^^u^uf (37) 

t,MU \ i=i J 

Lagrange duahty does not guaranty an equahty in (37) since (34) is not convex. However, it is easy to 
see that the inequality is tight and can be attained by choosing u = Uj. Finally, (33) is obtained by 
replacing the maximum with a supremum and relaxing the constraint. ■ 

'There is also a non-recursive characterization known as Courant-Fischer theorem which results in a similar maximin 
representation [10]. 
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Lemma 2 allows us to find \j in a distributed manner. We replace K with K = K + UDU-^ where 
U is a p X (j — 1) matrix with the preceding eigenvectors as its columns and D is a (j — 1) x (j — 1) 
diagonal matrix with sufficiently high constants on its diagonal, and search for its principal component. 
The matrix K does not necessarily satisfy the sparse block structure of K so we cannot use the solution 
in Proposition 1 directly. Fortunately, it can be easily adjusted since the modification to K is of low rank. 

Proposition 2: Let K be a symmetric matrix with = K^^ = 0. Then, the constraint 



t < eig„ 












Kc,c 












+ UDU^ 



(38) 



is equivalent to the following pair of constraints 

t < eig„i„(K;,,6+[U],.D[U]g 



U 



D 



U 



where 



U 



D 




I 


D 



Mu (t) 



(39) 
(40) 

(41) 
(42) 



and the message matrix Mu (t) is defined as 



Mu(t) = 



K 



c,6 



(Kfe,6 + [U],.D[U]J^-tl) ' 



Km [U],^D 



(43) 



Proof: The proof is similar to that of Proposition 1 and therefore omitted. ■ 
Thus, the solution to the j'th largest eigenvalue is similar to the method in Section II-C. The only 

difference is that the messages are slightly larger. Each message is a matrix of size l^l +j — 1 x l^l +j — 1. 

In practice, dimensionality reduction involves only a few principal components and this method is efficient 

when \S\ + j — 1 is considerably less than p (the size of the messages in a fully centralized protocol). 
The higher order components can therefore be found in a distributed manner as detailed in Section 

II-D above. 
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III. DPCA IN DECOMPOSABLE GRAPHS 

We now proceed to the general problem of DPCA in decomposable graphs. In the previous section, 
we showed that DPCA can be computed in a distributed manner if it is a priori known that and x;, are 
conditionally independent given Xc. Graphical models are intuitive characterizations of such conditional 
independence structures. In particular, decomposable models are graphs that can be recursively subdivided 
into the two cliques graph in Fig. 1. Therefore, this section consists of numerous technical definitions 
taken from [16] followed by a recursive application of the previous results. 

An undirected graph ^ is a set of nodes coimected by undirected edges. A random vector x satisfies the 
Markov property with respect to G, if for any pair of non-adjacent nodes the corresponding pair of random 
variables are conditionally independent on the rest of the elements in x. In the Gaussian distribution, this 
definition results in sparsity in the concentration domain. If K is the concentration matrix of a jointly 
Gaussian multivariate x that satisfies Q, then [K]-^ = for any pair {i,j} of non-adjacent nodes. 

Decomposable graphs are a specific type of graph which possess an appealing structure. A graph is 
decomposable if it can be recursively be subdivided into disjoint sets of nodes a, b and c, where c 
separates a and b, and c is complete, i.e., there are no edges between a and b and all the nodes within c 
are coimected by an edge. Clearly, the simplest non-trivial decomposable graph is the two cliques graph 
in Fig. 1. 

A clique is a maximal subset of nodes which is fully coimected. It is convenient to represent a 
decomposable graph using a sequence of cUques Ci , ■ ■ ■ , Ck which satisfy a perfect elimination order. 
An important property of this order is that Sj separates Hj-i\Sj from Rj where 

Hj = Ci U C2 U • • • U Cj, j = l,---,K (44) 

Sj = Hj_,nCj, j = 2,---,K (45) 

Rj = Hj\Hj_u j = 2,---,K. (46) 

Note that this perfect elimination order induces an inherent asymmetry between the cliques which will 
be used in our recursive solution below. The two cUques graph in Fig. 1 is a simple example of a 
decomposable graph with Ci = {a,c}, C2 = {c,b}, S2 = {c}. Hi = {a,c}, H2 = {a,c,b} and 
R2 = {b}. Accordingly, S2 = {c} separates Hi\S2 = {a} from C2\S2 = {b}. 

Similarly to the previous section, global ML estimation of the concentration matrix in decomposable 
Gaussian graphical model has a simple closed form. It can be computed in a distributed manner: 

K ^ 

K = Y^ [k^'"^'=] - [k^'"^"] (47) 

k=l k=2 
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where the local estimates are defined as: 

frCC ^ j^gCC^ -1 ^ k = l,---,K (48) 
^s„s^ = (sS^,s,y^ ^ A; = 2, • • • , if, (49) 

and 

sC.A = l^[x.]^jx,]5^^ ^ = 1'---'^ (50) 

1=1 
1 " 

S^"^'' = -EN5jx&> A; = 2,-..,if. (51) 

i=l 

The zero fill-in operator [ j^ in (47) outputs a matrix of the same dimension as K where the argument 
occupies the appropriate sub-block and the rest of the matrix has zero valued elements (See (8) for a 
two clique example, and [16] for the exact definition of this operator). 

DPCA can be recursively implemented by using the previous two clique solution. Indeed, Proposition 
1 shows that the eigenvalue inequality 

t < eig^i, (K) (52) 
is equivalent to two adjusted local eigenvalue inequalities 

* < eig^i„ (K'^, (t)) (53) 
t < eig^i„ (k'^^_^ (i)) (54) 

where 

K'^Jt) = Kr,,r, (55) 

K'H^_At) = KH^.,,H..At)-[Mk{t)f. (56) 

where Mj. (t) is a message as in (23) and [-J^ is the zero fill-in operator. Next, we can apply Schur's 
Lemma again and replace (54) with two additional inequalities: 

t < eig^i, (K^j^ (0) (57) 
t < eig^i, (k'^^_ Jt)) (58) 
t < eig^i, (K'i,^_ Jt)) (59) 

where K^^^ {t) and K^^^ {t) are similarly defined. We continue in an iterative fashion until we obtain 
K decoupled eigenvalue inequalities. Thus, the feasibihty of a given t can be checked in a distributed 
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manner with minimal message passing between the chques, and any line-search can efficiently solve 
DPCA. 

Specifically, in Algorithm 1 displayed below we provide a pseudo code for DPCA that solves for t 
using the bisection method. Given initial bounds 

L < eig^i„ (K) < U, (60) 

Algorithm 1 is guaranteed to find the minimal eigenvalue up to any required tolerance e within Iog2 ^^-j^ 
iterations. Each iteration consists of up to K — 1 messages through the matrices (t) whose dimensions 
are equal to the cardinaUties of Sk ior k = 2, K . A simple choice for the bounds is L = since K 
is positive definite, and 

^ = u ?™^{6igmm (Kc„cJ} (61) 
k=l,---,K 

as proved in the Appendix. 

Given a principal eigenvalue A, its corresponding eigenvector can be computed by solving Qu = 
where Q = K — AI as detailed in Section II-D. Begimung with k = K v/e partition into Rk and 
Hk-i and test the singularity of Qr^^r^. If it is singular, then A is associated with Rk. Otherwise, we 
send the message (A) to Hk-i and repartition it. We continue until we find the associated remainder 
Rk or reach the first clique. Then, we compute the corresponding local null vector and begin propagating 
it to the higher remainders as expressed in (29). A pseudo code of this method is provided in Algorithm 
2 below. 

Algorithm 1 can be easily extended to compute higher order eigenvalues through application of 
Proposition 2. For this purpose, note that the inequality in (40) has the same structure as (38) and 
therefore can be recursively partitioned again. The only difference is that the rank of the modification is 
increased at each chque and requires larger message matrices. Thus, the algorithm is efficient as long as 
the size of the separators (l-Sfel), the number of cliques (K) and the number of required eigenvalues (j) 
are all relatively small in comparison to p. Given any eigenvalue (first or high order). Algorithm 2 finds 
the associated eigenvector in a distributed and efficient manner 

IV. Synthetic tracking example 

We now illustrate the performance of DPCA using a synthetic numerical example. Specifically, we use 
DPCA to track the first principle component in a slowly time varying setting. We define a simple graphical 
model with 305 nodes representing three fully connected networks with only 5 coupling nodes, i.e., Ci = 
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Algorithm 1: Bisection line searcli for DPCA 
Input: K, L, U, e, clique tree structure 

Output: t 

while U — L > € do 

t = {U + L)/2 

Q = K 

for A; = , 2 do 

if * < eigmin(Qii.,ijJ then 

else 

U = t 

break loop 
end 

end 

\iU >t then 

if t < eigjnin (Qci,Ci) then 
I L = t 

else 

I U = t 

end 

end 

end 



{1, • • • , 100, 301, • • • , 305}, C2 = {101, • • • , 200, 301, • • • , 305}, and C3 = {201, • • • , 300, 301, • • • , 305}. 
We generate 5500 length p = 305 vectors Xj of zero mean, unit variance and independent Gaussian 
random variables. At each time point, we define K through (47)-(51) using a shding window of n = 500 
reaUzations with 400 samples overlap. Next, we run DPCA using Algorithm 1 . Due to slow time variation, 
we define the lower (L) and upper {U) bounds as the value of the previous time point minus and plus 0.1, 
respectively. We define the tolerance as e = 0.001 corresponding to 8 iterations. Figure 2 shows the exact 
value of the minimal eigenvalue as a function of time along with its DPCA estimates at the 4'th, 6'th and 
8'th iterations. It is easy to see that a few iterations suffice for tracking the maximal eigenvalue at high 
accuracy. Each iteration involves three EVDs of approximately 105 x 105 matrices and communication 
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Algorithm 2: Eigenveclor computalion via Qu = 
Input: Q, clique tree structure 

Output: u 

u = 
Q = K 

k = K 

while (/c > 1) & {Qr^.b,,, non singular) do 

Qsk,Sk = Qsk,Sk - 
k = k-l 

end 

u(Cfc) = Unuii(Qc„cJ 

for k = k + 1, - ■ ■ ,K do 

I u(i?fe) = -QRlR^QR,,s,u{Sk) 

end 




Fig. 2. Iterations of the DPCA bisection line-search in a time varying scenario. 

through two messages of size 5x5. For comparison, a centrahzed solution would require sending a 
set of 100 length 305 vectors to a central processing unit which computes an EVD of a matrix of size 
305 X 305. 
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V. Application to distributed anomaly detection in networks 

A promising application for DPCA is distributed anomaly detection in computer networks. In this 
context, PCA is used for learning a low dimensional model for normal behavior of the traffic in the 
network. The samples are projected into the subspace associated with the first principal components. 
Anomalies are then easily detected by examining the residual norm. Our hypothesis is that the connectivity 
map of the network is related to its statistical graphical model. The intuition is that two distant Unks 
in the network are (approximately) independent conditioned on the Unks connecting them and therefore 
define a graphical model. We do not rigorously support this claim but rather apply it in a heuristic manner 
in order to illustrate DPCA. 

Following [15], [12], we consider a real world dataset of Abilene, the Intemet2 backbone network. 
This network carries traffic from universities in the United States. Figure 3 shows its connectivity map 
consisting of 11 routers and 41 links (each edge corresponds to two links and there are additional links 
from each of the nodes to itself). Examining the network it is easy to see that the links on the east 
and west sides of the map are separated through six coupling hnks: DNVR-KSCY, SNVA-KSCY and 
LOSA-HSTN. Thus, our first approximated decomposable graph, denoted by ^2ciiques> consists of two 
cUques: an eastern cUque and a western clique coupled by these six links. Graph ^2ciiques corresponds 
to a decomposable concentration matrix with a sparsity level of 0.33. Our second decomposable graph 
denoted by ^sciiques is obtained by redividing the eastern cUque again into two cUques separated through 
the four coupling links: IPLS-CHIN and ATLA-WASH. Its corresponding concentration matrix has a 
sparsity level of 0.43. Finally, for comparison we randomly generate an arbitrary graph t/random over the 
Abilene nodes, with an identical structure as ^scUqucs (three cliques of the same cardinalities), which is 
not associated with the topology of the Abilene network. 

In our experiments, we learn the 41 x 41 covariance matrix from a 41 x 1008 data matrix representing 
1008 samples of the load on each of the 41 Abilene Unks during April 7-13, 2003. We compute PCA 
and project each of the 1008 samples of dimension 41 into the null space of the first four principal 
components. The norm of these residual samples is plotted in the top plot of Fig. 4. It is easy to 
see the spikes putatively associated with anomalies. Next, we examine the residuals using DPCA with 
^2ciiqucs» ^Sciiqucs and ^/random- The norms of the residuals are plotted in the three lower plots of Fig. 4., 
respectively. As expected, the topology based plots are quite similar with spikes occurring at the times 
of these anomalies. Thus, we conclude that the decomposable graphical model for Abilene is a good 
approximation and does not cause substantial loss of information (at least for the purpose of anomaly 
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detection). On the other hand, the residual norm using the random graph is a poor approximation as it does 
not preserve the anomalies detected by the full non-distributed PCA. These conclusions are supported in 
Fig. 5 where we show the absolute errors of DPCA with respect to PCA using the different graphical 
models. It is easy to see that ^2ciiques results in minimal error, ^scUques provides a reasonable tradeoff 
between performance and computational complexity (through its increased sparsity level), while graph 
^random is clearly a mismatched graphical model and results in significant increase in error. 

VI. Discussion and future work 

In this paper, we introduced DPCA and derived a decentralized method for its computation. We 
proposed distributed anomaly detection in communication networks as a motivating apphcation for DPCA 
and investigated possible graphical models for such settings. 
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Fig. 5. Absolute error in projection into anomaly subspace with different graphical models. 



Future work should examine the statistical properties of DPCA. From a statistical perspective, DPCA 
is an extension of classical PCA to incorporate additional prior information. Thus, it would be interesting 
to analyze the distribution of its components and quantify their significance, both under the true graphical 
model and under mismatched models. In addition, DPCA is based on the intimate relation between the 
inverse covariance and the conditional Gaussian distribution. Therefore, it is also important to assess its 
sensitivity to non-Gaussian sources. Finally, alternative methods to ML in singular and ill conditioned 
scenarios should be considered. 
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Appendix 

In this appendix, we prove that the minimal eigenvalue of a p x p symmetric matrix K is less than 
or equal to the minimal eigenvalue of any of its sub-matrices, say ^ for some set of indices a. For 
simplicity, we assume that a = {I,-- - ,pa} for some integer pa < p. The proof is a simple apphcation 
of the Rayleigh quotient characterization of the minimal eigenvalues: 

eigmin (K) = mjn (62) 
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where v is the optimal solution to (65). 
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